The Chicago department of Public Health’s Food Protection Program provides a database which contains the information from inspection reports of restaurants and other food establishments in Chicago from 2010 to the present. It contains many informations about the establishments, like their type of facility (grocery stores, restaurants, coffee shop, …) and their locations. Many informations about the violations listed are also provided in the database, like the description of the findings that caused them and the reason that invoked the staff from the program to led an inspection.
The main research question is : Where can you eat safely in Chicago ?
The adjective safely is chosen wisely because it has multiple meanings : you can eat in a safe way making sure that the establishment where you go respects particular hygiene rules, but also that the place is safe according to the crime rate of its district.
To begin, it has been decided that the analysis of the data will be done temporally. Indeed, the field of restoration is known to be in a constant evolution : each year, many establishments are opening while other are closing or changing of owner - especially in big cities. This is why the different calculations and comparisons will be done by year. To answer the main question about where to eat in Chicago, the answer will be given based on the latest data. Also, a map visualization seems to be a good solution to diplay the results, as the analysis is spatio-temporal.
The hygiene of the public and private food establishments listed in the dataset will be calculated using on a hygiene score based on the different violations reported and the final result of the inspection. In order to compare the places to eat by delimited zones, a mean will be computed by community area.
Questions :
Are their significant differences of hygiene scores between the community areas ?
How is the evolution over time ?
Two datasets have been found, which contain the owner's et license's informations of Chicago establishments. It would be interesting to merge it with the first dataset in order to study the trends regarding the management of the establishments.
Questions :
Is their a relation between the number of establishments that an owner has and the hygiene scores obtained ?
In which way a change of owner changes the hygiene score of an establishment ?
The Chicago Crime Report can be used in order to compute a crime score for each community area, following the same principle than in the first point. This way, it will be possible to compare the crime and hygiene scores according to the community areas.
Questions :
Criminology domain
Is their a relation between the crime score and the hygiene scores obtained in the Comm Area ?
An important point is to pay intention to the number of inspected establishments compared to the total number of establishments. It is certain than the variations of this ratio between the community areas has an impact on the results. An explanation of the variations should be purposed. In order to give a complete answer to the main research question, the uninspected establishments have to be taken into account.
Hygiene Rules : Food Code Rules
Community Area : Chicago Community Area separation (availible on the site of Chicago city)
The purpose of the project is to help the consumer to easily choose where to eat in Chicago and to provide them an interactive and intuitive way to browse the different places offered to them. Also, it could help the Chicago department of Public Health’s Food Protection Program to adapt their methods relying on the situation described by the findings of the analysis (for example, if a prevention program should be proposed for a specific area or type of facility).
%matplotlib inline
import pandas as pd
import numpy as np
import re
from matplotlib.ticker import MaxNLocator
import matplotlib.pyplot as plt
import requests as req
from bs4 import BeautifulSoup
import seaborn as sns
from shapely.geometry import Point
from shapely.geometry.polygon import Polygon
import folium
import os
import json
from folium.plugins import MarkerCluster
First a quick look at how the dataset is organized.
eat_seat = pd.read_csv('food-inspections.csv',sep=',') #creation of the dataframe
eat_seat.head(3)
eat_seat.info()
The dataset contains :
To begin, it is primary to check if there are duplicates and, if so, decide how to handle them.
duplicates = eat_seat[eat_seat.duplicated(subset=None, keep=False)]
len(duplicates)
It is not realistic that two rows contain the exact same cells : two inspections cannot happen at the same time, the explanation is that one is the duplicate of the other. Considering this postulate, we can drop the duplicates.
eat_seat = eat_seat.drop_duplicates()
len(eat_seat)
Now, we want to see which values are missing.
eat_seat.isnull().sum()
preview = pd.DataFrame(eat_seat.isnull().sum()).plot.barh(title = 'Missing Values by Columns')
As we can see, the columns Historical Wards 2003-2015, Zip Codes, Community Areas, Census Tracts and Wards are empty and will be dropped.
We will only be using the DBA Name (the name under which the establishment is doing business ; DBA = doing business as), so we drop the AKA Name column too.
Also, we will use the Latitude and Longitude in our analysis, so we can drop de Location, Zip, City and State columns too. Address column is kept because it can help us to notice some inconsistencies.
eat_seat = eat_seat.drop(columns = ['AKA Name','Historical Wards 2003-2015', 'Zip Codes', 'Community Areas',\
'Census Tracts', 'Location', 'Wards', 'City', 'State', 'Zip'])
eat_seat.head(2)
In order to standardize the data, the numbers in the License column are converted into integer.
eat_seat['License #'] = pd.to_numeric(eat_seat['License #'], downcast= 'integer')
def nbestethic(nb) :
try :
return int(nb)
except :
print(nb)
return nb
eat_seat['License #'] = eat_seat['License #'].apply(nbestethic)
eat_seat.head(2)
There are also adjustments to make in the following columns, because the formats can be optimized :
Inspection Date, only the day will be kept, not the time of day that is actually not givenRisk, only the number will remaineat_seat['Inspection Date'] = eat_seat['Inspection Date'].apply(lambda x:x.split('T')[0])
from datetime import datetime, date, time
eat_seat['Inspection Date'] = eat_seat['Inspection Date'].apply(lambda x:datetime.strptime(x, '%Y-%m-%d'))
def inspection_per_year(dblp):
return dblp[['Inspection Date', 'Inspection ID']].groupby('Inspection Date').count()
ax = inspection_per_year(eat_seat).plot(rot=-45)
ax.set_xlabel('Year')
ax.set_ylabel('#Inspections')
len(eat_seat[eat_seat['Inspection Date'].isnull()])
We see in this graph that all the Inspection Date are between 2010 and 2019, which makes sense. Also, there is no missing inspection date. This column is now cleaned.
For the Risk column, we first want to check what types of risk are listed.
eat_seat.Risk.unique()
We will replace All and High Risk by 3, Medium Risk by 2 and Low Risk by 1.
eat_seat['Risk'] = eat_seat['Risk'].replace({'All':3, 'Risk 1 (High)':3, 'Risk 2 (Medium)':2, 'Risk 3 (Low)':1})
eat_seat['Risk'] = pd.to_numeric(eat_seat['Risk'], downcast= 'integer')
eat_seat.head(3)
def inspection_per_risk(dblp):
return dblp[['Risk', 'Inspection ID']].groupby('Risk').count()
ax = inspection_per_risk(eat_seat).plot(kind="barh")
ax.set_ylabel('Risk')
ax.set_xlabel('#Inspections')
len(eat_seat[eat_seat['Risk'].isnull()])
We see in this graph that all the Risk are between 1 and 3. There is, for now, 73 missing values, but as it is not a problem for now, we will just keep it in mind.
Then, we want to put the different facility types in categories that make sense for our project.
First, we thought about only select the "private" establishments, where it is possible to eat a main course (for example, the places where you can only eat an ice cream are deleted of our list). They all are categorized in order to be compared with each other.
public_dic = {'restaurant' : ['Restaurant', 'DINING HALL', 'TENT RSTAURANT'], \
'grocery_restaurant' : ['Grocery & Restaurant', 'GROCERY& RESTAURANT', 'GROCERY/RESTAURANT',\
'GROCERY/ RESTAURANT', 'GROCERY STORE/ RESTAURANT', 'GROCERY & RESTAURANT',\
'RESTAURANT/GROCERY', 'grocery & restaurant', 'RESTAURANT/GROCERY STORE',\
'GROCERY/TAQUERIA', 'GAS STATION/RESTAURANT'],\
'banquet' : ['LOUNGE/BANQUET HALL', 'BANQUET', 'Banquet Hall', 'BANQUET FACILITY', 'banquet hall',\
'banquets', 'Banquet Dining', 'Banquet/kitchen','RESTAURANT.BANQUET HALLS',\
'BANQUET HALL', 'Banquet', 'BOWLING LANES/BANQUETS'], \
'rooftop_restaurant' : ['Wrigley Roof Top', 'REST/ROOFTOP'],\
'bar_restaurant' : ['RESTAURANT/BAR', 'RESTUARANT AND BAR', 'BAR/GRILL', 'RESTAURANT/BAR/THEATER',\
'JUICE AND SALAD BAR', 'SUSHI COUNTER', 'TAVERN/RESTAURANT', 'tavern/restaurant',\
'TAVERN GRILL'], \
'bakery_restaurant' : ['BAKERY/ RESTAURANT', 'bakery/restaurant', 'RESTAURANT/BAKERY'], \
'liquor_restaurant' : ['RESTAURANT AND LIQUOR', 'RESTAURANT/LIQUOR'], \
'catering' : ['CATERING/CAFE', 'Catering'], \
'golden_diner' : ['Golden Diner']}
facilitytype = 'BANQUET'
len(eat_seat[eat_seat['Facility Type'] == facilitytype])
This command returns the number of occurencs of the Facility Type inputed.
With trying different types previously categorized and listed in the public_dic dictionary we have noted that the results were too distant to conduct a meaningful analysis. That's why we then decided to also select "public" establishments like school cafeterias and hospitals. It could be interesting to compare private and public inspection results.
private_dic = {'daycare' : ['Daycare Above and Under 2 Years', 'Daycare (2 - 6 Years)', 'Daycare Combo 1586',\
'Daycare (Under 2 Years)', 'DAYCARE 2 YRS TO 12 YRS', 'Daycare Night', 'DAY CARE 2-14',\
'Daycare (2 Years)', 'DAYCARE', 'ADULT DAYCARE', '15 monts to 5 years old', 'youth housing',\
'DAYCARE 1586', 'DAYCARE COMBO', '1584-DAY CARE ABOVE 2 YEARS', 'CHURCH/DAY CARE', 'DAY CARE',\
'DAYCARE 6 WKS-5YRS', 'DAY CARE 1023', 'DAYCARE 2-6, UNDER 6', 'Day Care Combo (1586)'], \
'school' : ['SCHOOL', 'School', 'PRIVATE SCHOOL', 'AFTER SCHOOL PROGRAM', 'COLLEGE',\
'BEFORE AND AFTER SCHOOL PROGRAM', 'Private School', 'TEACHING SCHOOL',\
'PUBLIC SHCOOL', 'CHARTER SCHOOL CAFETERIA', 'CAFETERIA', 'Cafeteria', 'cafeteria',\
'UNIVERSITY CAFETERIA', 'PREP INSIDE SCHOOL', 'CHARTER SCHOOL', 'school cafeteria',\
'CHARTER SCHOOL/CAFETERIA', 'School Cafeteria', 'ALTERNATIVE SCHOOL', 'CITY OF CHICAGO COLLEGE',\
'after school program', 'CHURCH/AFTER SCHOOL PROGRAM', 'AFTER SCHOOL CARE'], \
'childrens_services' : ["Children's Services Facility", 'CHILDRENS SERVICES FACILITY', \
"CHILDERN'S SERVICE FACILITY", "1023 CHILDREN'S SERVICES FACILITY", \
"1023 CHILDERN'S SERVICES FACILITY", "1023-CHILDREN'S SERVICES FACILITY", \
"1023 CHILDERN'S SERVICE FACILITY", "1023 CHILDERN'S SERVICE S FACILITY", \
'CHILDERN ACTIVITY FACILITY', "CHILDERN'S SERVICES FACILITY", '1023'], \
'adultcare' : ['Long Term Care', 'REHAB CENTER', 'Hospital', 'ASSISTED LIVING', 'SENIOR DAY CARE',\
'Assisted Living', 'NURSING HOME', 'ASSISTED LIVING FACILITY', 'SUPPORTIVE LIVING FACILITY',\
'Assisted Living Senior Care', 'Adult Family Care Center', '1005 NURSING HOME', \
'Long-Term Care Facility', 'LONG TERM CARE FACILITY', 'ASSISSTED LIVING',\
'Long-Term Care','Long Term Care Facility', 'VFW HALL']}
total_dic = {**public_dic , **private_dic}
#inverting the dict
facilities = {}
for key in total_dic :
for facility in total_dic[key] :
facilities[facility] = key
facilitygroups = pd.DataFrame(data = facilities.values(), index=facilities.keys(), columns = ['FacilityGroup'])
facilitygroups.head(3)
Creating the facilitygroups Dataframe, we'll next merge it with the eat_seat DataFrame
facilitygroups.index.name = 'Facility Type'
Changing the index name to merge the two DataFrame on it
eat_seat = pd.merge(eat_seat, facilitygroups, on = ['Facility Type'])
eat_seat.head(3)
Merging the 2 DataFrame on the Facility Type Column
Give us the FacilityGroup column
def inspection_per_facility(dblp):
return dblp[['FacilityGroup', 'Inspection ID']].groupby('FacilityGroup').count()
ax = inspection_per_facility(eat_seat).plot(kind="barh")
ax.set_ylabel('Facility Type')
ax.set_xlabel('#Inspections')
public_private = {'daycare' : 'Private', 'school' : 'Private', 'children-services' : 'Private', 'adultcare' : 'Private',\
'restaurant' : 'Public', 'grocery_restaurant' : 'Public', 'banquet' : 'Public', 'rooftop_restaurant' : 'Public',\
'bar_restaurant' : 'Public', 'bakery_restaurant' : 'Public', 'liquor_restaurant' : 'Public' , 'catering' : 'Public', 'golden_diner' : 'Public'}
This public_private dictionnary is filled with the global type (public vs private) of each type of facility
public_private_df = pd.DataFrame(data = public_private.values(), index=public_private.keys(), columns = ['Public_Private'])
public_private_df.head(5)
Creating the public_private_df to next merge it with the eat_seat DataFrame
public_private_df.index.name = 'FacilityGroup'
Renaming the index name to merge on it
eat_seat = pd.merge(eat_seat, public_private_df, on = ['FacilityGroup'])
eat_seat.head(3)
Merging the public_private_df and the eat_seat on FacilityGroup to create the Public_Private column
There are also missing license numbers, corresponding to the nan printed in the output of the subsection "Type casting". We see in the graph below that there is some inspection numbers corresponding to 0, which is not a valid license number.
def inspection_per_license(dblp):
return dblp[['License #', 'Inspection ID']].groupby('License #').count()
ax = inspection_per_license(eat_seat).plot(rot=-45)
ax.set_xlabel('License #')
ax.set_ylabel('#Inspections')
The following dataframe displays the establishments where the license number is missing or is corresponding to 0.0.
missing = eat_seat[eat_seat['License #'].isnull() | eat_seat['License #'] == 0.0]
missing.head()
len(missing)
In order to resolve this problem, we search if there are other establishments with the same DBA Name so that, if so, we can attribute their license number to the non-defined license ones.
missing_found = eat_seat[eat_seat['DBA Name'].isin(missing['DBA Name']) & (eat_seat['License #'].notnull() & eat_seat['License #'] != 0.0)]
missing_found.head()
len(missing_found)
We see that there is 3396 entries in the database having a License number and corresponding to an esthablishment who has at some point a missing License number.
We will search in those entries the ones corresponding to the DBA name, Latitude and Longitude of the ones having a missing License number, take the latest entry (the latest inspection), and infer that the missing License number can be replaced by the one of this entry.
missing_found = missing_found.drop(columns = ['Facility Type', 'Risk', 'Address', 'Inspection Type', 'Results', 'Violations', 'FacilityGroup'])
latest = pd.DataFrame(missing_found.groupby(['DBA Name', 'Latitude', 'Longitude'])['Inspection Date'].max())
latest.head()
len(latest)
#DataFrame.merge(self, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)
latest_v2 = pd.merge(latest, missing_found, how='inner', on=['Inspection Date', 'DBA Name', 'Latitude', 'Longitude'])
latest_v2.head()
In latest we have an entry per unique couple "DBA Name + Latitude + Longitude" corresponding to the latest inspection date found for this couple in missing_found (reminder : in missing_found, there is each entry of eat_seat that has the name of an establishment where at some point the License number is missing). In latest_v2 we keep every entry corresponding to those "DBA Name + Latitude + Longitude + latest inspection date" that appear in missing_found, in order to investigate the potential duplicates.
if len(latest) < len(latest_v2) :
print('Duplicates detected !')
else :
print('No Duplicates !')
We see that there is duplicates of entries because the length of latest_v2 is greater than the length of latest. We investigate that.
#DataFrame.duplicated(self, subset=None, keep='first')
latest_duplicated = latest_v2[latest_v2.duplicated(subset = ['Inspection Date', 'DBA Name', 'Latitude', 'Longitude'], keep=False)]
latest_duplicated
For Chartwells, Subway "41.878003 - 87.634299" and Wendy's those duplicates are not a problem because the License number is the same for each duplicate. But for the other 5 duplicates the License number varies and there is no obvious way to decide which license number to keep; those entries will then be dropped. For the others, the license number hereby found will be used to complete the primary database.
latest_duplicated = latest_duplicated.drop(columns = ['Inspection Date', 'Inspection ID'])
tuples = latest_duplicated.drop_duplicates(subset = ['DBA Name', 'Latitude', 'Longitude', 'License #'], keep = False)
tuples = tuples.drop_duplicates(subset = ['DBA Name', 'Latitude', 'Longitude'], keep = 'first')
tuples = tuples.set_index(['DBA Name', 'Latitude', 'Longitude'])
dup = latest_v2[latest_v2.duplicated(subset = ['DBA Name', 'Latitude', 'Longitude', 'License #'], keep='first')]
dup = dup.set_index(['DBA Name', 'Latitude', 'Longitude'])
unique = dup[~dup.index.isin(tuples.index)]
unique
latest_v2 = latest_v2.set_index(['DBA Name', 'Latitude', 'Longitude'])
latest_duplicated = latest_duplicated.set_index(['DBA Name', 'Latitude', 'Longitude'])
latest_v2 = latest_v2[~latest_v2.index.isin(latest_duplicated.index)]
to_fill = latest_v2.append(unique)
to_fill.head()
Now for every DBA Name, Latitude and Longitude having a missing License number in eat_seat but existing in to_fill, we can replace the missing License number in eat_seat by the one in to_fill. First we replace every Null value by 0.0 as we infered it was the same in the database.
eat_seat['License #'].fillna(0.0, inplace = True)
for i, row in eat_seat.iterrows() :
if (bool(float(row['License #']) == 0.0)) & ((row['DBA Name'], row['Latitude'], row['Longitude']) in to_fill.index) :
eat_seat.loc[i, 'License #'] = float(to_fill.loc[(row['DBA Name'], row['Latitude'], row['Longitude']),['License #']]) #replace by the one in to_fill
len(eat_seat[eat_seat['License #'] == 0.0])
We now delete the rows that still have a missing number, corresponding to those we were not able to infer.
missing = eat_seat[eat_seat['License #'] == 0.0]
eat_seat = eat_seat.drop(missing.index)
eat_seat = eat_seat.rename(columns={"License #": "License"}) #rename the column 'License #' into 'License'
len(eat_seat.License.unique())
def inspection_per_license(dblp):
return dblp[['License', 'Inspection ID']].groupby('License').count()
ax = inspection_per_license(eat_seat).plot(rot=-45)
ax.set_xlabel('License #')
ax.set_ylabel('#Inspections')
eat_seat.isnull().sum()
We saw earlier that the Community Area was empty. This is a problem beacause some of our analysis are based on it. We need to find them, and this is possible with the Latitude and Longitude
def inspection_per_latitude(dblp):
return dblp[['Latitude', 'Inspection ID']].groupby('Latitude').count()
ax = inspection_per_latitude(eat_seat).plot(rot=-45)
ax.set_xlabel('Latitude')
ax.set_ylabel('#Inspections')
def inspection_per_longitude(dblp):
return dblp[['Longitude', 'Inspection ID']].groupby('Longitude').count()
ax = inspection_per_longitude(eat_seat).plot(rot=-45)
ax.set_xlabel('Longitude')
ax.set_ylabel('#Inspections')
We see that there are entries with the Latitude or Longitude still missing. We create a dataframe containing them.
missing_location = eat_seat[(eat_seat['Longitude'].isnull()) | (eat_seat['Latitude'].isnull())]
len(missing_location)
only_longitude_or_latitude = missing_location = eat_seat[(eat_seat['Longitude'].isnull()) ^ (eat_seat['Latitude'].isnull())]
len(only_longitude_or_latitude)
We see that there is no entry where only either the Latitude or the Longitude is missing so we can consider one of the two missing as the only condition for the next iterating processes.
We search in the dataframe eat_seat if those missing_location entries are sometimes entered in the data with the Latitude and Longitude columns not empty.
eat_seat[eat_seat['License'].isin(missing_location['License']) & (eat_seat['Longitude'].notnull())]
We see that there is no entry we can use to infere the Latitude and Longitude missing : the rows with undefined coordinates are dropped.
Long_null = eat_seat[eat_seat['Longitude'].isnull()].index
eat_seat.drop(Long_null, inplace=True)
Lat_null = eat_seat[eat_seat['Latitude'].isnull()].index
eat_seat.drop(Lat_null, inplace=True)
eat_seat['Location'] = eat_seat.apply(lambda x : Point(x['Latitude'], x['Longitude']), axis = 1)
eat_seat.head(3)
https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-Community-Areas-current-/cauq-8yn6 provides a dataset containing A COMPLETER
It will allows to properly attribute the Community Areas numbers to each establishments, using their Location point and the Polygon of the Community Area.
We check if the Location point is in a Community Area Polygon, and if it's the case, we give to the restaurant the corresponding Community Area number.
commareas = pd.read_csv('CommAreas.csv', error_bad_lines=False)
commareas = commareas.drop(columns = ['PERIMETER', 'AREA', 'COMAREA_', 'COMAREA_ID',\
'COMMUNITY', 'AREA_NUM_1', 'SHAPE_AREA', 'SHAPE_LEN'])
commareas.head(3)
commareas.info()
The clean_polygon function is created to standardized the the_geom column which needs to be in the same format than the Location column from the eat_seat Dataframe.
def clean_polygon(poly) :
liste = poly.strip('MULTIPOLYGON (((').strip(')))').split(', ')
list_pairs = []
for coordinates in liste :
pair = coordinates.split(' ')
pair[0] = re.sub("[^0-9.]", "", pair[0])
pair[1] = re.sub("[^0-9.]", "", pair[1])
pair = tuple((float(pair[1]), -float(pair[0])))
list_pairs.append(pair)
return Polygon(list_pairs)
commareas.the_geom = commareas.the_geom.apply(clean_polygon)
commareas.head(3)
def comarea_attribution(row) :
var = 0.0
for i, area in commareas.iterrows() :
if (area[0].contains(row.loc['Location'])) :
var = float(area[1])
break
return var
This function is supposed to check if the Location point of each row is in a community area polygon. If it is, it returns the Community Area number, otherwise, it returns 0.0
eat_seat['Community Area'] = eat_seat.apply(lambda x : comarea_attribution(x), axis = 1)
eat_seat.head(3)
eat_seat.to_csv('eat_seat.csv')
def facility_per_comarea(dblp):
return dblp[['Community Area', 'Inspection ID']].groupby('Community Area').count()
ax = facility_per_comarea(eat_seat).plot(rot=-45)
ax.set_xlabel('Community Area')
ax.set_ylabel('#Inspections')
The Food Code Rules has changed since the 1st July 2018. After investigating those changes, it seems that only the denomination of the violations but not the violation itself has changed, and a few additionnal violations has been added in the possible violations. It means that those changes does not need more processing and can just be considered together as a common list of violations.
len(eat_seat.Violations.unique())
eat_seat.Violations[0]
It seems that every violation is a unique entry because it contains not only the violation type but also the comments of the inspectors. We have to split the Violations column into 3 different columns :
It seems that every violation cell is architectured this way : "number of the violation". "TYPE OF THE VIOLATION" - Comments : "comments of the inspector" (this format repeated as many times as the number of violations detected the day of the inspection, separated with a vertical line)
We just want to keep the violation number (because we can check which violation it is online). We create a column NumberViolations containing the ID of the violations found during the corresponding investigation. The rest is not kept because the titles of the violations can be found online and we do not plan on using the comments of the investigators.
As we can do this cleaning only for the rows where the field Violations is not empty, we will temporarily drop all the other rows.
temp = eat_seat.dropna(subset=['Violations'], axis = 0, how = 'all')
violations = temp.apply(lambda row: re.findall('\|\s([0-9]+)[.]', str(row['Violations'])), axis = 1)
first_violations = temp.apply(lambda row: row['Violations'].split('.')[0], axis = 1)
for violation, first_violation in zip(violations, first_violations):
violation.append(first_violation)
flat_list = [item for sublist in violations for item in sublist]
unique, counts = np.unique(flat_list, return_counts=True)
This function returns a list of infractions id for each row
temp = temp.assign(NumberViolations = violations)
temp = temp[['Inspection ID', 'NumberViolations']]
temp.head(3)
len(eat_seat)
Now that we have a dataframe with every inspection ID of the inspections where violations has been found and a column containing the list of those violations, we can add it to the primary dataframe.
eat_seat = pd.merge(eat_seat, temp, how='left', on='Inspection ID', left_index=True, right_index=False)
eat_seat = eat_seat.set_index(['Inspection ID']) #redifines the Index
eat_seat.head(3)
eat_seat.index.is_unique
len(eat_seat)
eat_seat.isnull().sum()
We see that there are more than 50'000 rows where the Violations column is empty. We have to study wether those cells are empty because there were no violations (meaning the estalishment is healthy) or because the inspection was not successfull (meaning we can drop the row because it can not be used in our research).
eat_seat.Results.unique()
We create new databases for every entries of the column Results in order to study them.
noentry = eat_seat[eat_seat['Results']=='No Entry']
outofbusiness = eat_seat[eat_seat['Results']=='Out of Business']
notready = eat_seat[eat_seat['Results']=='Not Ready']
businessnotlocated = eat_seat[eat_seat['Results']=='Business Not Located']
passwithconditions = eat_seat[eat_seat['Results']=='Pass w/ Conditions']
passed = eat_seat[eat_seat['Results']=='Pass']
fail = eat_seat[eat_seat['Results']=='Fail']
We investigate how many of each results' type has the column Violations empty.
results_dic = {'No Entry' : noentry, 'Out of Business' : outofbusiness, 'Not Ready' : notready,\
'Business Not Located' : businessnotlocated, 'Pass With Conditions' : passwithconditions, 'Pass' : passed, 'Fail' : fail}
for name, result in results_dic.items() :
print(name, ':', len(result[result['Violations'].isnull()]), 'empty Violation columns /', len(result),\
'columns =', (len(result[result['Violations'].isnull()])/len(result)), '\n')
We see that almost every entries where the Result is either 'No entry', 'Out of Business', 'Not ready' or 'Business not located' have the Violations field empty. We can safely drop those lines because they are not pertinent for our research.
results = ['Pass', 'Pass w/ Conditions', 'Fail']
eat_seat = eat_seat.loc[eat_seat['Results'].isin(results)]
len(eat_seat)
Now we have to take care of the cases where there is no Violations and the Result is either Pass, Fail or Pass with conditions.
When the result is Pass and the Violation field is empty, we can add a 0 in the column "NumberViolations".
eat_seat['NumberViolations'].fillna(0, inplace=True)
len(eat_seat)
When the result is either Fail or Pass with conditions but the Violations field is empty, we will drop those rows because there are missing values. An establishment can indeed not fail an inspection or receive conditions when no violation is found, those entries make no sense and can not be taken into account in our research.
results = ['Pass w/ Conditions', 'Fail']
EmptyViolations = eat_seat[eat_seat['NumberViolations'] == 0]
EmptyinResults = EmptyViolations.loc[eat_seat['Results'].isin(results)]
indexes = EmptyinResults.index
eat_seat = eat_seat.drop(labels = indexes)
len(eat_seat)
Now we can replace Pass by 1, Pass w/ conditions by 2, and Fail by 3 (that will be useful during the computation of the healthiness score).
eat_seat['Results'].unique()
eat_seat['Results'] = eat_seat['Results'].replace({'Fail':3, 'Pass w/ Conditions':2, 'Pass':1})
eat_seat.head(3)
eat_seat.Results = pd.to_numeric(eat_seat.Results, downcast = 'integer')
eat_seat.Results = eat_seat.Results.apply(nbestethic)
Now we can compute the healthiness score of each inspection by multiplying the Results score with the number of Violations :
#add 0 in the new column InspectionScore for every row
liste = []
for i in range (0, len(eat_seat)) :
liste.append(0)
eat_seat['InspectionScore'] = liste
#add the multiplication for the rows where NumberViolations != 0
for ID in eat_seat.index :
if eat_seat.at[ID, 'NumberViolations'] != 0 :
eat_seat.at[ID, 'InspectionScore'] = len(eat_seat.at[ID, 'NumberViolations']) * eat_seat.at[ID, 'Results'] * eat_seat.at[ID, 'Results'] * eat_seat.at[ID, 'Results']
Here, a little explanation for the formula used to compute the Inspection Score :
We have 2 parameters, the NumberViolations and Results given by the inspectors (1 : ok, 2 : pass w conditions, 3 : fail).
The NumberViolations is not the main value, beacause sometimes, it can cause some biais : if the restaurant gets only 1 infraction, but a big one, and a fail, the score will be 3 / if a restaurant get several infractions, let's say 5 and the note 2, it will get a 10 score.
To avoid this problem, we take the Results^3. Therefore, the NumberViolations is multiplied by 1, 8 or 27. This is more accurate than just multiplying them.
eat_seat.head(3)
def ret_year(row) :
return (row.loc['Inspection Date'].year)
Function that returns the year of the inspection based on a datetime object stored in Inspection Date
r = eat_seat.loc[2352683]
a = ret_year(r)
print(a)
Just a quick test, it seems it's working
eat_seat['Inspection Year'] = eat_seat.apply(lambda x : ret_year(x), axis = 1)
Applying this function to the df
eat_seat.head(3)
eat_seat.to_csv('eat_seat.csv')
eat_seat = pd.read_csv('eat_seat.csv')
def inspection_per_year(dblp):
return dblp[['Inspection Year','Inspection ID']].groupby('Inspection Year').count()
ax = inspection_per_year(eat_seat).plot(rot=-45)
ax.set_xlabel('Year')
ax.set_ylabel('#Inspections')
This part will only be used with the owner meanscore and number of restaurants. For the more global analysis, we can keep eat_seat. Here we want to take median if the restaurant has been insepected more than one time a year, and then groupby License to compute a mean of the Inspection Score for each License + DBA Name
duplicates = eat_seat[eat_seat.duplicated(['License','DBA Name', 'Inspection Year'])]
len(duplicates)
Here we can see that several facilities were inspected more than one time in a year
eat_seat_facility = eat_seat[['License','DBA Name','Community Area','InspectionScore','Inspection Year']].copy()\
.groupby(['License','DBA Name','Community Area','Inspection Year']).median()
Selecting the License/DBA Name/Comm Area/Inspection Score/Inspection Year, groupping them on License/DBA Name/Comm Area/Inspection Year and getting the medain for the Inspection Score.
We now have only the median Inspection Score for each facility for each year
eat_seat_facility.head(10)
eat_seat_facility.reset_index()
Now we wante to get the mean for each restaurant (License + DBA Name) to merge them wih the restaurant owner
eat_seat_facility.to_csv('eat_seat_facility.csv')
eat_seat_facility = pd.read_csv('eat_seat_facility.csv')
Storing them
duplicates = eat_seat_facility[eat_seat_facility.duplicated(['License','DBA Name','Inspection Year'])]
len(duplicates)
eat_seat_facility_2 = eat_seat_facility[['License','DBA Name','Community Area','InspectionScore']].copy()\
.groupby(['License','DBA Name','Community Area']).mean()
Here we only group on the License/DBA Name/Community Area to take the mean of the Inspection Score
eat_seat_facility_2.head(3)
eat_seat_facility_2.to_csv('eat_seat_facility_2.csv')
eat_seat_facility_2 = pd.read_csv('eat_seat_facility_2.csv')
Storing it
eat_seat_facility_2.head(3)
Now we see the precedent problem disappeared
We found two datasets on Kaggle https://www.kaggle.com/chicago/chicago-business-licenses-and-owners, gathering the business licenses and the business owners of Chicago. It could be interesting to observe the results of different establishments owned by the same person.
The first dataset contains the details about every licensed establishments. There are a lot of columns but the only ones interesting us are :
license id column to have a link with the chicago food inspections datasetaccount number column to have a link with the business owners datasetpolice district column in case we want to have a link with the crime datasetlicenses = pd.read_csv('business-licenses.csv', sep=',', low_memory = False) #creation of the dataframe
licenses = pd.read_csv('business-licenses.csv', sep=',') #creation of the dataframe
licenses = licenses.rename(str.lower, axis='columns')
licenses = licenses.drop(columns = ['city', 'state', 'id', 'precinct', 'ward precinct', 'business activity id',\
'license number', 'application type', 'application created date',\
'application requirements complete', 'payment date', 'conditional approval',\
'license term start date', 'license term expiration date', 'license approved for issuance',\
'date issued', 'license status', 'license status change date', 'ssa',\
'historical wards 2003-2015', 'zip codes', 'wards', 'census tracts', 'location',\
'license code', 'license description', 'business activity', 'site number',\
'zip code', 'latitude', 'longitude', 'address', 'legal name', 'doing business as name',\
'community areas', 'ward'])
licenses['police district'] = pd.to_numeric(licenses['police district'], downcast = 'integer')
licenses.head(3)
licenses = licenses.set_index('account number')
licenses.head(3)
print(len(licenses.index))
The second dataset contains the details about every license owners. We have decided to keep the following columns :
account number column to have a link with the business licenses datasetowner first name and the owner last name columns in order to create a full name columnowners = pd.read_csv('business-owners.csv',sep=',') #creation of the dataframe
owners = owners.rename(str.lower, axis='columns')
owners = owners.drop(columns = ['suffix', 'legal entity owner', 'owner middle initial', 'legal name', 'title'])
owners.head(3)
owners['full name'] = owners['owner first name'] + ' ' + owners['owner last name']
owners = owners.drop(columns = ['owner first name', 'owner last name'])
A full name column is enough for the needs that we have to link the licenses to the owners.
owners.head(3)
len(owners['account number'].unique())
len(owners['full name'].unique())
Here we can see that the number of accounts is not the same that the number of full names. For now, we will consider that a same account can be shared by several people (for example, it could be the case for partners owning a business together).
owners = pd.DataFrame(owners.groupby('account number')['full name'].apply(list))
owners.head(3)
We can observe that the lists of the full name column contain duplicates and 'nan' values. A function clean_list can be defined to clean them.
def clean_list(liste) :
cleaned = []
for element in liste :
if type(element) == str and element not in cleaned :
cleaned.append(element)
return cleaned
owners['full name'] = owners['full name'].apply(clean_list)
owners.head(3)
Setting both the indexes of the licenses and the owners dataframes we can now merge them together.
business = pd.merge(licenses, owners, right_index = True, left_index = True)
business.head(3)
business.reset_index(level=0, inplace=True)
business.head(3)
Setting both the indexes of the business and the eat_seat dataframes we can now merge them together.
business = business.rename(columns= {'license id' : 'License'})
eat_seat_2 = pd.merge(eat_seat_facility_2, business, on = ['License'])
eat_seat_2.head(3)
eat_seat_2.to_csv('newfood.csv')
license_owner = eat_seat_2.explode('full name').dropna(how='any', subset=['full name'])
Full Name column is filled with name of the multiple owners of each restaurant. These names are stored in a list and we one owner on each row to be able to group the data on them. This is what this cell does ! We also delete the row without the name of the owner
license_owner.head(3)
Looking if the explode was done correctly
license_owner.isnull().sum()
We can see here there is no more missin value for the Full Name column
license_owner_score_number = license_owner[['full name','InspectionScore']].copy().groupby('full name').mean()
Grouping the inspection score on the Full Name (owner), and then taking the median of the inspection score of all restaurants that belong to the owner
license_owner_score_number.head(3)
license_owner_score_number['N Restaurants'] = license_owner[['full name','DBA Name']].copy().groupby('full name').count()['DBA Name']
Grouping the DBA Name on the Full Name (owner), and then counting them to get the number of all the restaurants that belong to the owner
license_owner_score_number.head(10)
license_owner_score_number.to_csv('license_owner_score_number.csv')
Storing the dataframe
Eat_Year_Type = pd.DataFrame()
Eat_Year_Type['FacilityGroup'] = ['childrens_services','adultcare','bakery_restaurant','banquet','bar_restaurant','catering','daycare'\
,'golden_diner','grocery_restaurant'\
,'liquor_restaurant','restaurant','rooftop_restaurant','school']
Eat_Year_Type
for x in range(2010,2020) :
df = eat_seat.loc[eat_seat['Inspection Year'] == x][['FacilityGroup','InspectionScore']].copy().groupby('FacilityGroup').mean()
df_2 = pd.merge(Eat_Year_Type, df, how = 'outer', on = ['FacilityGroup'])
Eat_Year_Type[x] = df_2['InspectionScore']
This cell select the good row for each year, and then group the Inspection Score (mean) on the FacilityGroup
Eat_Year_Type
Eat_Year_Type.to_csv('Eat_Type_Year.csv')
Eat_Year_PriPub = pd.DataFrame()
Eat_Year_PriPub['Public_Private'] = ['Public','Private']
Eat_Year_PriPub
for x in range(2010,2020) :
df = eat_seat.loc[eat_seat['Inspection Year'] == x][['Public_Private','InspectionScore']].copy().groupby('Public_Private').mean()
df_2 = pd.merge(Eat_Year_PriPub, df, how = 'outer', on = ['Public_Private'])
Eat_Year_PriPub[x] = df_2['InspectionScore']
This cell select the good row for each year, and then group the Inspection Score (mean) on the Public_Private
Eat_Year_PriPub
Eat_Year_Type.to_csv('Eat_PriPub_Year.csv')
Storing the data
crime_2008_2011 = pd.read_csv('Chicago_Crimes_2008_to_2011.csv', error_bad_lines=False)
crime_2012_2017 = pd.read_csv('Chicago_Crimes_2012_to_2017.csv', error_bad_lines=False)
crime = crime_2008_2011.append(crime_2012_2017)
crime.head(3)
Load the data found on https://www.kaggle.com/currie32/crimes-in-chicago in pandas DataFrames and concatenate them.
crime.info()
Taking a quick look at the composition of the Dataframe
crime = crime.drop(['District', 'Ward', 'Unnamed: 0','Case Number','IUCR','Arrest', 'Domestic','FBI Code','X Coordinate','Y Coordinate','Updated On', 'Location Description'], axis=1)
crime.head(3)
The datasets have several columns not interesting for us, there are dropped. In terms of location we decide to only keep the Community Area column in order to have a link with the chicago food inspections dataset. We also keep the longitude and latitude to maybe map the crimes for the visualization.
duplicates = crime[crime.duplicated(keep='first')]
len(duplicates)
Taking a look at the duplicates of the crime DataFrame. We see that multiple crimes ares stored more than one time in the dataset
crime = crime.drop_duplicates(keep='first')
len(crime)
Deleting the duplicates, and taking a look at the Datafarme length after the deletion of duplicates
for x in crime.columns :
print(x + ' : ' + str(crime[x].isnull().values.any()) + ' --> ' + str(crime[x].isnull().sum()))
Here we are looking at the missing values. Most of the missing data are part of the Latitude, Longitude and Location columns.
For our analysis, we need the community area value, here 869 community area are missing.
869 is only 0.03% of 2997929, we decide to drop these rows.
crime = crime.dropna(how='any', subset=['Community Area'])
Dropping the row with Community Area missing values
for x in crime.columns :
print(x + ' : ' + str(crime[x].isnull().values.any()) + ' --> ' + str(crime[x].isnull().sum()))
def block_approx(address) :
return address.replace('XX','00')
The Block column contains anonymized addresses ("XX"). By replacing the XX by 00 we already can get an approximative position of where the crimes took place.
crime.Block = crime.Block.apply(block_approx)
print(set(crime['Primary Type']))
We print all the Primary Type of crimes in order to create a dictionary with every crime type keyed with their minimum sentence in terms of a year of imprisonment. -- More details below.
crime_penalty_dic = {'PROSTITUTION' : 0.1, 'DOMESTIC VIOLENCE' : 0.1, 'MOTOR VEHICLE THEFT' : 3.0, 'ASSAULT' : 0.1, 'OFFENSE INVOLVING CHILDREN' : 0.1,\
'RITUALISM' : 0.1, 'BATTERY' : 0.1, 'NON-CRIMINAL (SUBJECT SPECIFIED)' : 0.1, 'CRIM SEXUAL ASSAULT' : 4.0, 'GAMBLING' : 0.1,\
'PUBLIC INDECENCY' : 0.1, 'OTHER OFFENSE' : 0.1, 'LIQUOR LAW VIOLATION' : 0.1, 'OTHER NARCOTIC VIOLATION' : 0.1, 'OBSCENITY' : 0.1,\
'NON-CRIMINAL' : 0.1, 'KIDNAPPING' : 3.0, 'HOMICIDE' : 20.0, 'NARCOTICS' : 0.1, 'ARSON' : 6.0, 'DECEPTIVE PRACTICE' : 0.1, 'ROBBERY' : 3.0,\
'BURGLARY' : 3.0, 'NON - CRIMINAL' : 0.1, 'INTIMIDATION' : 2.0, 'HUMAN TRAFFICKING' : 4.0, 'SEX OFFENSE' : 4.0, 'CRIMINAL TRESPASS' : 0.1,\
'CONCEALED CARRY LICENSE VIOLATION' : 2.0, 'CRIMINAL DAMAGE' : 1.0, 'INTERFERENCE WITH PUBLIC OFFICER' : 0.1, 'PUBLIC PEACE VIOLATION' : 0.1,\
'WEAPONS VIOLATION' : 0.1, 'THEFT' : 0.1, 'STALKING' : 0.1}
To make the crime_penalty dictionary, we checked the minimum prison penalty (in years) for each crime types. For the crime where the penalty is less than 1 year, the value is fixed at 0.1 when no prison is needed, and the score in month (6 months = 0.5) if a minmum prison penalty is given in months.
These values are taken from the Illinois Penalty Code.
crime_penalty = pd.DataFrame(data = crime_penalty_dic.values(), index=crime_penalty_dic.keys(), columns = ['Crime Score'])
crime_penalty.head(3)
crime_penalty.index.name = 'Primary Type'
crime = pd.merge(crime, crime_penalty, on = ['Primary Type'])
crime.head(3)
Merging the crime and crime_penalty Dataframes to compute Crime Score for each row
crime.to_csv('crime.csv')
Here we generate 2 dataframe : Inspection Score per Year per Comm Area and Crime Score per Year per Comm Area
Crime_Year_CommArea = pd.DataFrame()
Crime_Year_CommArea['Community Area'] = np.arange(0.0,78.0)
Creating a new Dataframe with a column for the 77 Comm Area of Chicago
for x in range(2010,2018) :
df = crime.loc[crime['Year'] == x][['Community Area','Crime Score']].copy().groupby('Community Area').sum()
Crime_Year_CommArea[x] = df['Crime Score']
Selecting the comm area and crime score for each year, then group by comm area to compute the sum of crimescore for each comm area for each year
Crime_Year_CommArea
Eat_Year_CommArea = pd.DataFrame()
Eat_Year_CommArea['Community Area'] = np.arange(0.0,78.0)
Creating a new Dataframe with a column for the 77 Comm Area of Chicago
for x in range(2010,2020) :
df = eat_seat.loc[eat_seat['Inspection Year'] == x][['Community Area','InspectionScore']].copy().groupby('Community Area').mean()
Eat_Year_CommArea[x] = df['InspectionScore']
Selecting the comm area and inspection score for each year, then group by comm area to compute the median of inspectionscore for each comm area for each year
Eat_Year_CommArea
Eat_Year_CommArea.to_csv('Eat_Year_CommArea.csv')
Crime_Year_CommArea.to_csv('Crime_Year_CommArea.csv')
These 2 DataFrame will be useful to compute the correlation, and for the different visualisations
eat_seat_facility_2 = pd.read_csv('eat_seat_facility_2.csv')
Here we load the eat_seat_facility_2 DataFrame
eat_seat_facility_2.head(3)
eat_seat_facility_3 = eat_seat_facility_2[['License','Community Area']].copy().groupby('Community Area').count()
Here we group by community Area on License to compute the number of restaurant for each Community Area
eat_seat_facility_3.rename(columns={'License' : 'N of restaurants'})
eat_seat_facility_3.to_csv('eat_seat_facility_3.csv')
Storing the DataFrame
Now that we have computed the crime score and the hygiene score for every community area in Chicago, we want to do some visualization with maps. For this, we will use Folium. It allows us to create a map of Chicago and draw the boudaries of all the community areas. For this we use the GeoJSON file found on the Chicago Data Portal https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-Community-Areas-current-/cauq-8yn6 (corresponding to the csv file used earlier to define the boudaries of the community areas). Then we use color scales to determine where there is more crime and non hygienic facilities.
#opening the dataframe saved in the precedent part
eat_year_commareas = pd.read_csv('Eat_Year_CommArea.csv',sep=',')
eat_year_commareas.head(3)
#getting rid of decimal in Community Area and turning them into string type
eat_year_commareas['Community Area'] = eat_year_commareas['Community Area'].astype('int')
eat_year_commareas['Community Area'] = eat_year_commareas['Community Area'].astype('str')
eat_year_commareas.head(3)
#geojson file containing the boundaries for each community area in Chicago
comareas_geo = r'Boundaries.geojson'
#creation of the map centered on Chicago city
eat_map_2010 = folium.Map(location=[41.85, -87.55],
tiles='Stamen Terrain',
zoom_start=9.5,
)
#adding to the map the boundaries and the hygiene scores of 2010
folium.Choropleth(name='Hygiene scores in 2010',
geo_data = comareas_geo,
data = eat_year_commareas,
columns = ['Community Area', '2010'],
key_on = 'feature.properties.area_num_1',
fill_color = 'YlOrRd',
fill_opacity = 0.7,
line_opacity = 0.2,
threshold_scale = [0, 20, 40, 60, 80, 100, 120],
legend_name = 'Hygiene score per community area in 2010'
).add_to(eat_map_2010)
eat_map_2010.save('eat_map_2010.html')
eat_map_2010
#same as above for 2011
eat_map_2011 = folium.Map(location=[41.85, -87.55],
tiles='Stamen Terrain',
zoom_start=9.5,
)
folium.Choropleth(name='Hygiene scores in 2011',
geo_data = comareas_geo,
data = eat_year_commareas,
columns = ['Community Area', '2011'],
key_on = 'feature.properties.area_num_1',
fill_color = 'YlOrRd',
fill_opacity = 0.7,
line_opacity = 0.2,
threshold_scale = [0, 20, 40, 60, 80, 100, 120],
legend_name = 'Hygiene score per community area in 2011'
).add_to(eat_map_2011)
eat_map_2011.save('eat_map_2011.html')
eat_map_2011
#same as above for 2012
eat_map_2012 = folium.Map(location=[41.85, -87.55],
tiles='Stamen Terrain',
zoom_start=9.5,
)
folium.Choropleth(name='Hygiene scores in 2012',
geo_data = comareas_geo,
data = eat_year_commareas,
columns = ['Community Area', '2012'],
key_on = 'feature.properties.area_num_1',
fill_color = 'YlOrRd',
fill_opacity = 0.7,
line_opacity = 0.2,
threshold_scale = [0, 20, 40, 60, 80, 100, 120],
legend_name = 'Hygiene score per community area in 2012'
).add_to(eat_map_2012)
eat_map_2012.save('eat_map_2012.html')
eat_map_2012
#same as above for 2013
eat_map_2013 = folium.Map(location=[41.85, -87.55],
tiles='Stamen Terrain',
zoom_start=9.5,
)
folium.Choropleth(name='Hygiene scores in 2013',
geo_data = comareas_geo,
data = eat_year_commareas,
columns = ['Community Area', '2013'],
key_on = 'feature.properties.area_num_1',
fill_color = 'YlOrRd',
fill_opacity = 0.7,
line_opacity = 0.2,
threshold_scale = [0, 20, 40, 60, 80, 100, 120],
legend_name = 'Hygiene score per community area in 2013'
).add_to(eat_map_2013)
eat_map_2013.save('eat_map_2013.html')
eat_map_2013
#same as above for 2014
eat_map_2014 = folium.Map(location=[41.85, -87.55],
tiles='Stamen Terrain',
zoom_start=9.5,
)
folium.Choropleth(name='Hygiene scores in 2014',
geo_data = comareas_geo,
data = eat_year_commareas,
columns = ['Community Area', '2014'],
key_on = 'feature.properties.area_num_1',
fill_color = 'YlOrRd',
fill_opacity = 0.7,
line_opacity = 0.2,
threshold_scale = [0, 20, 40, 60, 80, 100, 120],
legend_name = 'Hygiene score per community area in 2014'
).add_to(eat_map_2014)
eat_map_2014.save('eat_map_2014.html')
eat_map_2014
#same as above for 2015
eat_map_2015 = folium.Map(location=[41.85, -87.55],
tiles='Stamen Terrain',
zoom_start=9.5,
)
folium.Choropleth(name='Hygiene scores in 2015',
geo_data = comareas_geo,
data = eat_year_commareas,
columns = ['Community Area', '2015'],
key_on = 'feature.properties.area_num_1',
fill_color = 'YlOrRd',
fill_opacity = 0.7,
line_opacity = 0.2,
threshold_scale = [0, 20, 40, 60, 80, 100, 120],
legend_name = 'Hygiene score per community area in 2015'
).add_to(eat_map_2015)
eat_map_2015.save('eat_map_2015.html')
eat_map_2015
#same as above for 2016
eat_map_2016 = folium.Map(location=[41.85, -87.55],
tiles='Stamen Terrain',
zoom_start=9.5,
)
folium.Choropleth(name='Hygiene scores in 2016',
geo_data = comareas_geo,
data = eat_year_commareas,
columns = ['Community Area', '2016'],
key_on = 'feature.properties.area_num_1',
fill_color = 'YlOrRd',
fill_opacity = 0.7,
line_opacity = 0.2,
threshold_scale = [0, 20, 40, 60, 80, 100, 120],
legend_name = 'Hygiene score per community area in 2016'
).add_to(eat_map_2016)
eat_map_2016.save('eat_map_2016.html')
eat_map_2016
#same as above for 2017
eat_map_2017 = folium.Map(location=[41.85, -87.55],
tiles='Stamen Terrain',
zoom_start=9.5,
)
folium.Choropleth(name='Hygiene scores in 2017',
geo_data = comareas_geo,
data = eat_year_commareas,
columns = ['Community Area', '2017'],
key_on = 'feature.properties.area_num_1',
fill_color = 'YlOrRd',
fill_opacity = 0.7,
line_opacity = 0.2,
threshold_scale = [0, 20, 40, 60, 80, 100, 120],
legend_name = 'Hygiene score per community area in 2017'
).add_to(eat_map_2017)
eat_map_2017.save('eat_map_2017.html')
eat_map_2017
As we can see from the maps, there is no particular trend, neither a generalized increase or decrease.
We can see that only a few community areas went over a hygiene score of a hundred : Burnside (47) in 2013 and Washington Park (40) and Woodlawn (42) in 2015.
We could assume that they are the worst community areas to go eating alongside with some others that stood out with a hygiene score superior to eighty as Brighton Park (58) in 2010, Austin (25) in 2012, South Deering (51) in 2014, East Garfield Park (27) and Hyde Park (41) in 2015, West Garfield Park (26) in 2016 and Hermosa (20) in 2017.4
The com_eat DataFrame gathers the mediane of the HygieneScores of all the inspections for each Community Area per Year.
com_eat = pd.read_csv('Eat_Year_CommArea.csv',sep=',') #creation of the dataframe
com_eat['Community Area'] = com_eat['Community Area'].apply(str)
com_eat = com_eat.drop(0)
com_eat = com_eat.drop(columns='Unnamed: 0')
com_eat.head()
com_eat['Median'] = com_eat.drop(columns='Community Area').median(axis=1)
com_eat.head()
The following figure allows to visualize the HygieneScores per Year for each Community Area. It is possible to unselect the year we don't want to display clicking on its label on the right of the figure.
import plotly
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
commarea = com_eat['Community Area']
fig = go.Figure()
colorlist = ('indianred', 'lightsalmon', 'darkorchid', 'firebrick', 'cornsilk', 'saddlebrown', 'lavenderblush', 'chocolate')
c = 0
av = []
for x in range(2010, 2018) :
x = str(x)
fig.add_trace(go.Bar(x=commarea, y=com_eat[x], name=x, marker_color=colorlist[c]))
c += 1
fig.add_trace(go.Scatter(x=commarea, y=com_eat['Median'], name='Median', marker_color='black'))
# Here we modify the tickangle of the xaxis, resulting in rotated labels.
fig.update_layout(barmode='group', xaxis_tickangle=-45)
fig.update_layout(title_text='Hygiene Score per Community Area')
fig.show()
plotly.offline.plot(fig, filename = 'HygieneScore_ComArea_Bars.html', auto_open=True)
The figure allows to see in details the HygieneScores of each Community Area. The Mean line helps to see the bars tending to deviate from the Median computed.
What we can see is that there is no particular trend, with a Median oscillating between about 30 and 60.
The entry which really is above the rest is the Community Area 47 with a Median at 4, except in 2013 it has low HygieneScores. According to http://www.encyclopedia.chicagohistory.org/pages/194.html, the 47th Community Area of Chicago, called Burnside, is situated on the border of the city and is a "comfortable residential community".
commarea = com_eat['Community Area']
fig = go.Figure()
colorlist = ('indianred', 'lightsalmon', 'darkorchid', 'firebrick', 'cornsilk', 'saddlebrown', 'lavenderblush', 'chocolate')
c = 0
for x in range(2010, 2018) :
x = str(x)
fig.add_trace(go.Scatter(x=commarea, y=com_eat[x], name=x, marker_color=colorlist[c]))
c += 1
#fig.add_trace(go.Scatter(mode='markers', x=commarea, y=com_eat['Median'], name='Median', marker_color='black'))
# Here we modify the tickangle of the xaxis, resulting in rotated labels.
fig.update_layout(barmode='group', xaxis_tickangle=-45)
fig.update_layout(title_text='Hygiene Score per Community Area')
fig.show()
plotly.offline.plot(fig, filename = 'HygieneScore_ComArea_Scatter.html', auto_open=True)
The figure has been computed to get another point of view of the HygieneScores. The same constatations are done than before, which confirm the no particular trend.
The following figure allows to visualize the results in the ascending order - from top to bottom : the top Community Areas have the best HygieneScore (the lowest).
com_eat2010 = com_eat.sort_values(by=['2010'], ascending=False)
commarea = com_eat2010['Community Area']
fig = go.Figure(go.Bar(x=com_eat2010['2010'], y=commarea, orientation='h'))
fig.update_layout(title_text='Hygiene Score per Community Area in 2010', barmode='stack',\
yaxis=dict(type='category', title='Community Area', tickmode='linear'), xaxis=dict(title='HygieneScore'))
fig.show()
plotly.offline.plot(fig, filename = 'HygieneScore_ComArea_2010.html', auto_open=True)
com_eat2011 = com_eat.sort_values(by=['2011'], ascending=False)
commarea = com_eat2011['Community Area']
fig = go.Figure(go.Bar(x=com_eat2011['2011'], y=commarea, orientation='h'))
fig.update_layout(title_text='Hygiene Score per Community Area in 2011', barmode='stack',\
yaxis=dict(type='category', title='Community Area', tickmode='linear'), xaxis=dict(title='HygieneScore'))
fig.show()
plotly.offline.plot(fig, filename = 'HygieneScore_ComArea_2011.html', auto_open=True)
Again, the two figures here show that there is no particular trend. No need to display for each year : neither the top 10 nor the bottom 10 are alike, except for the 47th Community Area which stays at the top 1.
For the 47th Community Area, https://en.wikipedia.org/wiki/Burnside,_Chicago#cite_note-cmap-1 explains that it has known the White Flight phenomenon (https://en.wikipedia.org/wiki/White_flight), leading the businesses to move away. The low HygieneScores could either be explained by a smallest number of establishments, but also as said before because of the confort of the zone which could ease the maintenance of the establishments.
commarea = com_eat['Community Area']
fig = go.Figure()
colorlist = ('firebrick', 'indianred', 'lightsalmon', 'darkorchid', 'cornsilk', 'saddlebrown', 'lavenderblush', 'chocolate')
c = 0
for x in range(2010, 2018) :
x = str(x)
fig.add_trace(go.Bar(x=com_eat[x], y=com_eat['Community Area'], orientation='h', name=x, marker_color=colorlist[c]))
c += 1
fig.update_layout(title_text='Hygiene Score per Community Area', barmode='stack',\
yaxis=dict(type='category', title='Community Area', tickmode='linear'), xaxis=dict(title='HygieneScore'))
fig.show()
plotly.offline.plot(fig, filename = 'HygieneScore_ComArea_2.html', auto_open=True)
The corr function give the pearson coefficient between the HygieneScores and the Community Areas.
heat = com_eat.drop(columns='Median').corr()
heat
fig = go.Figure(data=go.Heatmap(x = heat.index, y = heat.index, z=heat))
layout = go.Layout(title = 'Pearson Coeficient HygieneScore - Community Area')
fig.show()
plotly.offline.plot(fig, filename = 'Corr_eat.html', auto_open=True)
Considering the constant variations in the food domain, the results obtained could simply indicated that the inspections are fair, following the phenomenon of changes, unpredictable due to the behaviors in the restauration industry.
The types DataFrame gathers the result of the inspections for each FacilityGroup.
types = pd.read_csv('Eat_Type_Year.csv',sep=',') #creation of the dataframe
types = types.drop(0)
types = types.drop(columns='Unnamed: 0')
To begin, the statut of the establishment (Private or Public) is added to the DataFrame so that the difference between the two could be well visualized.
def sep(row):
if row['FacilityGroup'] in ['bakery_restaurant', 'grocery_restaurant', 'bar_restaurant', 'banquet',\
'catering', 'restaurant', 'golden_diner', 'rooftop_restaurant', 'liquor_restaurant'] :
return 'Public'
if row['FacilityGroup'] in ['adultcare', 'childrens_services', 'daycare', 'school'] :
return 'Private'
types['Stat'] = types.apply (lambda row: sep(row), axis=1)
types
Then two DataFrames private and public are created slicing the types DataFrame. This way, they can be put on a same figure but well separated.
private = types.loc[types['Stat'] == 'Private']
private = pd.DataFrame(private)
private['Median'] = private.median(axis=1)
public = types.loc[types['Stat'] == 'Public']
public = pd.DataFrame(public)
public['Median'] = public.median(axis=1)
fig = go.Figure()
colorlist1 = ('indianred', 'lightsalmon', 'darkorchid', 'firebrick', 'cornsilk', 'saddlebrown', 'lavenderblush', 'chocolate', 'lightgoldenrodyellow', 'lightgray')
colorlist2 = ('lightgrey', 'lightgreen', 'magenta', 'maroon', 'mediumaquamarine', 'mediumorchid', 'mediumpurple', 'purple', 'red', 'rosybrown')
c = 0
for x in range(2010, 2020) :
x = str(x)
fig.add_trace(go.Bar(x=private.FacilityGroup, y=private[x], name='Private'+x, marker_color=colorlist1[c]))
fig.add_trace(go.Bar(x=public.FacilityGroup, y=public[x], name='Public'+x, marker_color=colorlist2[c]))
c += 1
# Here we modify the tickangle of the xaxis, resulting in rotated labels.
fig.update_layout(barmode='group', xaxis_tickangle=-45)
fig.update_layout(title_text='HygieneScore per Facility Type', xaxis=dict(title='Facility Type'), yaxis=dict(title='HygieneScore'))
fig.add_trace(go.Scatter(x=private.FacilityGroup, y=private['Median'], name='Median', marker_color='black', mode='lines'))
fig.add_trace(go.Scatter(x=public.FacilityGroup, y=public['Median'], name='Median', marker_color='black', mode='lines'))
fig.show()
plotly.offline.plot(fig, filename = 'HygieneScore_FacilityType.html', auto_open=True)
On this figure, we can study the differences between Public and Private establishments.
The same no particular trend can be observed for the Public establishments, except for the Restaurant type of facility which happens to have no important variations until 2018 where, then for 2019, the rise of the HygieneScores is huge. The Restaurant type of facility is the one with the more entries in the dataset.
For the Private establishments, we can say that the HygieneScores are more stable. They seems to follow the same trend, with a rise in 2018 and 2019.
The Private establishements are the more sensitive. Because of the way they works - children, elderly are in their care and are often expensive - they generally have to follow specific rules. Their particular duty could explain the fact that their results are more constant than the ones of the Public establishments.
The business DataFrame gathers the Owner Name, Number of Restaurants by Owner and HygieneScores .
business = pd.read_csv('license_owner_score_number.csv',sep=',') #creation of the dataframe
business.head()
business = business[['InspectionScore','N Restaurants']].copy().groupby('N Restaurants').mean()
business = business.reset_index()
The corr function give the pearson coefficient between the HygieneScores and the Number of Restaurants by Owner.
business['InspectionScore'].corr(business['N Restaurants'])
business.corr()
The HygieneScores and the Number of Restaurants by Owner are not correlated following the Pearson method.
The result obtained is contrary to our thinking : we would have thought that the more establishments a owner has, the more able he is to enforce rules fitting the Food Code - owning several establishements would induce more experiment and resources. Apparently, this is not the case !
We do exactly the same than before with the hygiene scores, with the dataframe that contains the crime score per community area per year.
#opening the dataframe saved in the precedent part
crime_year_commareas = pd.read_csv('Crime_Year_CommArea.csv',sep=',')
crime_year_commareas.head(3)
#getting rid of decimal in Community Area and turning them into string type
crime_year_commareas['Community Area'] = crime_year_commareas['Community Area'].astype('int')
crime_year_commareas['Community Area'] = crime_year_commareas['Community Area'].astype('str')
crime_year_commareas.head(3)
#creation of the map centered on Chicago city
crime_map_2010 = folium.Map(location=[41.85, -87.55],
tiles='Stamen Terrain',
zoom_start=9.5,
)
#adding to the map the boundaries and the crime scores of 2010
folium.Choropleth(name='Crime scores in 2010',
geo_data = comareas_geo,
data = crime_year_commareas,
columns = ['Community Area', '2010'],
key_on = 'feature.properties.area_num_1',
fill_color = 'YlOrRd',
fill_opacity = 0.7,
line_opacity = 0.2,
threshold_scale = [0, 2000, 4000, 6000, 8000, 10000, 12000, 14000, 16000],
legend_name = 'Crime score per community area in 2010'
).add_to(crime_map_2010)
crime_map_2010.save('crime_map_2010.html')
crime_map_2010
#same as above for 2011
crime_map_2011 = folium.Map(location=[41.85, -87.55],
tiles='Stamen Terrain',
zoom_start=9.5,
)
folium.Choropleth(name='Crime scores in 2011',
geo_data = comareas_geo,
data = crime_year_commareas,
columns = ['Community Area', '2011'],
key_on = 'feature.properties.area_num_1',
fill_color = 'YlOrRd',
fill_opacity = 0.7,
line_opacity = 0.2,
threshold_scale = [0, 2000, 4000, 6000, 8000, 10000, 12000, 14000, 16000],
legend_name = 'Crime score per community area in 2011'
).add_to(crime_map_2011)
crime_map_2011.save('crime_map_2011.html')
crime_map_2011
#same as above for 2012
crime_map_2012 = folium.Map(location=[41.85, -87.55],
tiles='Stamen Terrain',
zoom_start=9.5,
)
folium.Choropleth(name='Crime scores in 2012',
geo_data = comareas_geo,
data = crime_year_commareas,
columns = ['Community Area', '2012'],
key_on = 'feature.properties.area_num_1',
fill_color = 'YlOrRd',
fill_opacity = 0.7,
line_opacity = 0.2,
threshold_scale = [0, 2000, 4000, 6000, 8000, 10000, 12000, 14000, 16000],
legend_name = 'Crime score per community area in 2012'
).add_to(crime_map_2012)
crime_map_2012.save('crime_map_2012.html')
crime_map_2012
#same as above for 2013
crime_map_2013 = folium.Map(location=[41.85, -87.55],
tiles='Stamen Terrain',
zoom_start=9.5,
)
folium.Choropleth(name='Crime scores in 2013',
geo_data = comareas_geo,
data = crime_year_commareas,
columns = ['Community Area', '2013'],
key_on = 'feature.properties.area_num_1',
fill_color = 'YlOrRd',
fill_opacity = 0.7,
line_opacity = 0.2,
threshold_scale = [0, 2000, 4000, 6000, 8000, 10000, 12000, 14000, 16000],
legend_name = 'Crime score per community area in 2013'
).add_to(crime_map_2013)
crime_map_2013.save('crime_map_2013.html')
crime_map_2013
#same as above for 2014
crime_map_2014 = folium.Map(location=[41.85, -87.55],
tiles='Stamen Terrain',
zoom_start=9.5,
)
folium.Choropleth(name='Crime scores in 2014',
geo_data = comareas_geo,
data = crime_year_commareas,
columns = ['Community Area', '2014'],
key_on = 'feature.properties.area_num_1',
fill_color = 'YlOrRd',
fill_opacity = 0.7,
line_opacity = 0.2,
threshold_scale = [0, 2000, 4000, 6000, 8000, 10000, 12000, 14000, 16000],
legend_name = 'Crime score per community area in 2014'
).add_to(crime_map_2014)
crime_map_2014.save('crime_map_2014.html')
crime_map_2014
#same as above for 2015
crime_map_2015 = folium.Map(location=[41.85, -87.55],
tiles='Stamen Terrain',
zoom_start=9.5,
)
folium.Choropleth(name='Crime scores in 2015',
geo_data = comareas_geo,
data = crime_year_commareas,
columns = ['Community Area', '2015'],
key_on = 'feature.properties.area_num_1',
fill_color = 'YlOrRd',
fill_opacity = 0.7,
line_opacity = 0.2,
threshold_scale = [0, 2000, 4000, 6000, 8000, 10000, 12000, 14000, 16000],
legend_name = 'Crime score per community area in 2015'
).add_to(crime_map_2015)
crime_map_2015.save('crime_map_2015.html')
crime_map_2015
#same as above for 2016
crime_map_2016 = folium.Map(location=[41.85, -87.55],
tiles='Stamen Terrain',
zoom_start=9.5,
)
folium.Choropleth(name='Crime scores in 2016',
geo_data = comareas_geo,
data = crime_year_commareas,
columns = ['Community Area', '2016'],
key_on = 'feature.properties.area_num_1',
fill_color = 'YlOrRd',
fill_opacity = 0.7,
line_opacity = 0.2,
threshold_scale = [0, 2000, 4000, 6000, 8000, 10000, 12000, 14000, 16000],
legend_name = 'Crime score per community area in 2016'
).add_to(crime_map_2016)
crime_map_2016.save('crime_map_2016.html')
crime_map_2016
#same as above for 2017
crime_map_2017 = folium.Map(location=[41.85, -87.55],
tiles='Stamen Terrain',
zoom_start=9.5,
)
folium.Choropleth(name='Crime scores in 2017',
geo_data = comareas_geo,
data = crime_year_commareas,
columns = ['Community Area', '2017'],
key_on = 'feature.properties.area_num_1',
fill_color = 'YlOrRd',
fill_opacity = 0.7,
line_opacity = 0.2,
threshold_scale = [0, 2000, 4000, 6000, 8000, 10000, 12000, 14000, 16000],
legend_name = 'Crime score per community area in 2017'
).add_to(crime_map_2017)
crime_map_2017.save('crime_map_2017.html')
crime_map_2017
As we can see from the maps, the crime score stays more or less stable with a slight decrease from 2010 to 2015. In 2017, we can observe very low crime scores everywhere. As we read nothing that could explain this, we think it is an issue in the dataset, maybe a non-complet dataset for this year.
We can also see that one community area particularly stands out every year (except 2017) : Austin. Getting information on this community area we find it is one of the most populated community area and that there is in Austin a lot of violent crimes. That explains the high crime score that it gets as crimes scores are calculated based on the penalty of the crimes.
The crime DataFrame gathers the mediane of the CrimeScores of all the inspections for each Community Area per Year.
crime = pd.read_csv('Crime_Year_CommArea.csv',sep=',') #creation of the dataframe
crime = crime.drop(0)
crime = crime.drop(columns='Unnamed: 0')
crime.head()
crime['Community Area'] = crime['Community Area'].apply(str)
The following figure allows to visualize the CrimeScores per Year for each Community Area. It is possible to unselect the year we don't want to display clicking on its label on the right of the figure.
commarea = crime['Community Area']
fig = go.Figure()
colorlist = ('indianred', 'lightsalmon', 'darkorchid', 'firebrick', 'cornsilk', 'saddlebrown', 'lavenderblush', 'chocolate')
c = 0
for x in range(2010, 2018) :
x = str(x)
fig.add_trace(go.Bar(x=commarea, y=crime[x], name=x, marker_color=colorlist[c]))
c += 1
# Here we modify the tickangle of the xaxis, resulting in rotated labels.
fig.update_layout(barmode='group', xaxis_tickangle=-45)
fig.update_layout(title_text='Crime Score per Community Area')
fig.show()
plotly.offline.plot(fig, filename = 'CrimeScore_ComArea_Bars.html', auto_open=True)
This figure allows to see in details the CrimeScores of each Community Area.
What we can already see is that the CrimeScore are very different inter-Community Areas, but stay in the same ranges intra-Community Areas. Mostly, the CrimeScores have decreased between 2010 and 2015, then have rised in 2016.
This rise cannot be explain by our analysis of the Chicago Food Inspections, but it already has been discussed. Quickly searching on Google, many articles relate this trend :
"The city's overall crime rate, especially the violent crime rate, is higher than the US average. Chicago was responsible for nearly half of 2016's increase in homicides in the US, though the nation's crime rates remain near historic lows. The reasons for the higher numbers in Chicago remain unclear." - https://en.wikipedia.org/wiki/Crime_in_Chicago
Apparently, their analysis of the Chicago Food Inspections could not have explained it either.
The second observation is that the numbers for 2017 are very low. The high difference with the rest of the data is not normal, so that we think that there is a problem with the file used.
commarea = crime['Community Area']
fig = go.Figure()
colorlist = ('indianred', 'lightsalmon', 'darkorchid', 'firebrick', 'cornsilk', 'saddlebrown', 'lavenderblush', 'chocolate')
c = 0
for x in range(2010, 2018) :
x = str(x)
fig.add_trace(go.Scatter(x=commarea, y=crime[x], name=x, marker_color=colorlist[c]))
c += 1
# Here we modify the tickangle of the xaxis, resulting in rotated labels.
fig.update_layout(barmode='group', xaxis_tickangle=-45)
fig.update_layout(title_text='Crime Score per Community Area')
fig.show()
plotly.offline.plot(fig, filename = 'CrimeScore_ComArea_Scatter.html', auto_open=True)
The point of view on this figure allows to well visualized the fact that there is a general trend in which the CrimeScores varie linearly by CommunityArea.
The following figure allows to visualize the results in the descending order : the top Community Areas have the best CrimeScore (the lowest).
crime2010 = crime.sort_values(by=['2010'], ascending=False)
commarea = crime2010['Community Area']
fig = go.Figure(go.Bar(x=crime2010['2010'], y=commarea, orientation='h'))
fig.update_layout(title_text='Crime Score per Community Area in 2010', barmode='stack',\
yaxis=dict(type='category', title='Community Area', tickmode='linear'), xaxis=dict(title='CrimeScore'))
fig.show()
plotly.offline.plot(fig, filename = 'CrimeScore_ComArea_2010.html', auto_open=True)
crime2011 = crime.sort_values(by=['2011'], ascending=False)
commarea = crime2011['Community Area']
fig = go.Figure(go.Bar(x=crime2011['2011'], y=commarea, orientation='h'))
fig.update_layout(title_text='Crime Score per Community Area in 2011', barmode='stack',\
yaxis=dict(type='category', title='Community Area', tickmode='linear'), xaxis=dict(title='CrimeScore'))
fig.show()
plotly.offline.plot(fig, filename = 'CrimeScore_ComArea_2011.html', auto_open=True)
This time, the two figures are showing the same trends. The same Community Areas are found, not exactly in the same order but still.
The corr function give the pearson coefficient between the CrimeScores and the Community Areas.
heat = crime.drop(columns = '2017').corr()
heat
fig = go.Figure(data=go.Heatmap(x = heat.index, y = heat.index, z=heat))
layout = go.Layout(width = 700, height = 700, autosize = False , title = 'Pearson Coeficient HygieneScore - Community Area')
fig.show()
plotly.offline.plot(fig, filename = 'Corr_crime.html', auto_open=True)
for n in range(2010, 2018) :
n = str(n)
# Create figure with secondary y-axis
fig = make_subplots(specs=[[{"secondary_y": True}]])
# Add traces
fig.add_trace(go.Scatter(x=crime['Community Area'], y=crime[n], name="CRIME" + n), secondary_y=False,)
fig.add_trace(go.Scatter(x=com_eat['Community Area'], y=com_eat[n], name="HYGIENE" + n), secondary_y=True,)
# Add figure title
fig.update_layout(title_text="Crime and Hygiene Scores " + n)
# Set x-axis title
fig.update_xaxes(title_text="Community Area")
# Set y-axes titles
fig.update_yaxes(title_text="<b>CrimeScore</b>", secondary_y=False)
fig.update_yaxes(title_text="<b>HygieneScore</b>", secondary_y=True)
fig.show()
plotly.offline.plot(fig, filename = 'Double_' + n + '.html', auto_open=True)
It is hard to say anything relevant yet. Another calculation could be useful.
The eat_crime DataFrame gathers the CrimeScores and HygieneScores of all the inspections for each Community Area.
eat_crime = pd.read_csv('Crime_Eat_CommArea.csv',sep=',') #creation of the dataframe
eat_crime = eat_crime.drop(0)
eat_crime.head()
The corr function give the pearson coefficient between the HygieneScores and the CrimeScores.
eat_crime['InspectionScore'].corr(eat_crime['Crime Score'])